20 research outputs found
Deciding What to Model: Value-Equivalent Sampling for Reinforcement Learning
The quintessential model-based reinforcement-learning agent iteratively
refines its estimates or prior beliefs about the true underlying model of the
environment. Recent empirical successes in model-based reinforcement learning
with function approximation, however, eschew the true model in favor of a
surrogate that, while ignoring various facets of the environment, still
facilitates effective planning over behaviors. Recently formalized as the value
equivalence principle, this algorithmic technique is perhaps unavoidable as
real-world reinforcement learning demands consideration of a simple,
computationally-bounded agent interacting with an overwhelmingly complex
environment, whose underlying dynamics likely exceed the agent's capacity for
representation. In this work, we consider the scenario where agent limitations
may entirely preclude identifying an exactly value-equivalent model,
immediately giving rise to a trade-off between identifying a model that is
simple enough to learn while only incurring bounded sub-optimality. To address
this problem, we introduce an algorithm that, using rate-distortion theory,
iteratively computes an approximately-value-equivalent, lossy compression of
the environment which an agent may feasibly target in lieu of the true model.
We prove an information-theoretic, Bayesian regret bound for our algorithm that
holds for any finite-horizon, episodic sequential decision-making problem.
Crucially, our regret bound can be expressed in one of two possible forms,
providing a performance guarantee for finding either the simplest model that
achieves a desired sub-optimality gap or, alternatively, the best model given a
limit on agent capacity.Comment: Accepted to Neural Information Processing Systems (NeurIPS) 202
Hindsight-DICE: Stable Credit Assignment for Deep Reinforcement Learning
Oftentimes, environments for sequential decision-making problems can be quite
sparse in the provision of evaluative feedback to guide reinforcement-learning
agents. In the extreme case, long trajectories of behavior are merely
punctuated with a single terminal feedback signal, engendering a significant
temporal delay between the observation of non-trivial reward and the individual
steps of behavior culpable for eliciting such feedback. Coping with such a
credit assignment challenge is one of the hallmark characteristics of
reinforcement learning and, in this work, we capitalize on existing
importance-sampling ratio estimation techniques for off-policy evaluation to
drastically improve the handling of credit assignment with policy-gradient
methods. While the use of so-called hindsight policies offers a principled
mechanism for reweighting on-policy data by saliency to the observed trajectory
return, naively applying importance sampling results in unstable or excessively
lagged learning. In contrast, our hindsight distribution correction facilitates
stable, efficient learning across a broad range of environments where credit
assignment plagues baseline methods
Inclusive Artificial Intelligence
Prevailing methods for assessing and comparing generative AIs incentivize
responses that serve a hypothetical representative individual. Evaluating
models in these terms presumes homogeneous preferences across the population
and engenders selection of agglomerative AIs, which fail to represent the
diverse range of interests across individuals. We propose an alternative
evaluation method that instead prioritizes inclusive AIs, which provably retain
the requisite knowledge not only for subsequent response customization to
particular segments of the population but also for utility-maximizing
decisions
Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models
A centerpiece of the ever-popular reinforcement learning from human feedback
(RLHF) approach to fine-tuning autoregressive language models is the explicit
training of a reward model to emulate human feedback, distinct from the
language model itself. This reward model is then coupled with policy-gradient
methods to dramatically improve the alignment between language model outputs
and desired responses. In this work, we adopt a novel perspective wherein a
pre-trained language model is itself simultaneously a policy, reward function,
and transition function. An immediate consequence of this is that reward
learning and language model fine-tuning can be performed jointly and directly,
without requiring any further downstream policy optimization. While this
perspective does indeed break the traditional agent-environment interface, we
nevertheless maintain that there can be enormous statistical benefits afforded
by bringing to bear traditional algorithmic concepts from reinforcement
learning. Our experiments demonstrate one concrete instance of this through
efficient exploration based on the representation and resolution of epistemic
uncertainty. In order to illustrate these ideas in a transparent manner, we
restrict attention to a simple didactic data generating process and leave for
future work extension to systems of practical scale
A Tale of Two DRAGGNs: A Hybrid Approach for Interpreting Action-Oriented and Goal-Oriented Instructions
Robots operating alongside humans in diverse, stochastic environments must be
able to accurately interpret natural language commands. These instructions
often fall into one of two categories: those that specify a goal condition or
target state, and those that specify explicit actions, or how to perform a
given task. Recent approaches have used reward functions as a semantic
representation of goal-based commands, which allows for the use of a
state-of-the-art planner to find a policy for the given task. However, these
reward functions cannot be directly used to represent action-oriented commands.
We introduce a new hybrid approach, the Deep Recurrent Action-Goal Grounding
Network (DRAGGN), for task grounding and execution that handles natural
language from either category as input, and generalizes to unseen environments.
Our robot-simulation results demonstrate that a system successfully
interpreting both goal-oriented and action-oriented task specifications brings
us closer to robust natural language understanding for human-robot interaction.Comment: Accepted at the 1st Workshop on Language Grounding for Robotics at
ACL 201